1
The Foundations of Multi-Arm Bandit Problems
AI029 Lesson 2
00:00

Welcome to the ultimate arena of decision-making under uncertainty. Imagine you are in a casino, facing a row of slot machinesβ€”the classic n-armed bandit problem. This is the fundamental nonassociative setting of reinforcement learning, where we strip away the complexity of changing environments to focus on one burning question: How do we choose the best action when we don't know the rules?

Action Aβ‚œ Reward Rβ‚œβ‚Šβ‚ , State Sβ‚œ t = 0, 1, 2, 3, ... AGENT ENV

The Interaction Framework

Reinforcement learning is a considerable abstraction of goal-directed learning. At each time step $t = 0, 1, 2, \dots$, the agent perceives a state $S_t \in \mathcal{S}$, selects an action $A_t \in \mathcal{A}(S_t)$, and receives a reward $R_{t+1} \in \mathcal{R}$. In the bandit problem, the state is irrelevant, forcing us to master the action-selection policy through pure interaction.

Paradigm Feedback Type Learning Mechanism
Supervised Learning Instructive (The "Right" Answer) Pattern Matching
Bandit Problems Evaluative (A Score) Trial-and-Error Search

The Exploration-Exploitation Dilemma

Because the agent is never told the optimal action, it faces a paralyzing conflict. It must Exploit what it already knows to secure immediate rewards, but it must also engage in Active Exploration to uncover hidden gems that might yield even higher returns in the future. This tension distinguishes the bandit problem from static optimization and is the heartbeat of adaptive intelligence.